Abstract
Background: Machine learning and deep learning models are increasingly used to refine prognosis in acute myeloid leukemia, but clinicians still lack a clear benchmark of real-world performance, the added value of genomics over routine clinical data, and the extent of overfitting when models leave the development cohort. We conducted a PRISMA-guided systematic review and meta-analysis to quantify discrimination at fixed prognostic horizons, estimate optimism between training and independent validation, and compare clinical or laboratory feature sets with gene-centric models to inform practical adoption in AML.
Methods: We followed PRISMA 2020. PubMed, Scopus, and Web of Science were searched from 2018 through March 2025 for AML studies that developed or externally validated AI prognostic models for overall or relapse-free survival and reported AUC with 95% confidence intervals. Two reviewers screened and extracted data with consensus resolution; risk of bias was assessed with QUIPS. Discrimination was harmonized at 1, 2, 3, and 5 years. We pooled AUCs using DerSimonian–Laird random-effects (fixed-effect for reference) and quantified heterogeneity with Q and I². Optimism was defined as the paired development minus external-validation AUC difference. Small-study effects and robustness were examined with contour-enhanced funnel plots, Begg's test, leave-one-out analyses, and trim-and-fill. Analyses were performed in R.
Result: We included 24 studies contributing 137 model cohorts representing about 51,055 patients. Random-effects pooling of the 73 truly independent validation cohorts yielded an overall AUC 0.769 (95% CI 0.742 to 0.795), while the fixed-effect model gave 0.757 (95% CI 0.752 to 0.762); heterogeneity was extreme (Q 1,723.9; I² 95.7%; τ² 0.0109). Discrimination improved with longer prognostic horizons: 1 year 0.748 (k=24; 95% CI 0.7393 to 0.7567), 2 years 0.760 (k=12; 95% CI 0.7471 to 0.7729), 3 years 0.760 (k=30; 95% CI 0.7525 to 0.7675), and 5 years 0.833 (k=16; 95% CI 0.8237 to 0.8423); I² exceeded 94% at each horizon. Across 53 development cohorts the pooled AUC was 0.801 (95% CI 0.792 to 0.810), versus 0.749 (95% CI 0.736 to 0.762) at external validation, yielding a mean optimism ΔAUC 0.052 (95% CI 0.041 to 0.063; p<0.001). For five-year models the optimism gap was 0.125 (p=0.003). In modality-specific validation, non-genetic models (clinical, laboratory, flow cytometry, imaging) achieved AUC 0.776 (95% CI 0.745 to 0.807; I² 94.8%) versus 0.741 (95% CI 0.716 to 0.766; I² 93.5%) for gene-centric models; the difference ΔAUC 0.035 did not reach statistical significance (p=0.085). Bias and robustness checks were reassuring: Begg's rank-correlation showed Kendall τ −0.092 (p=0.14); leave-one-out meta-analyses of the 73 validation studies with reported CIs produced pooled AUCs ranging 0.760 to 0.777; Duval–Tweedie trim-and-fill imputed five studies and shifted the pooled AUC modestly to 0.781. Methodological quality was generally acceptable across 120 QUIPS domain ratings: 74% low risk, 25% unclear, and <1% high; participant selection scored best and statistical analysis weakest.
Conclusion: externally validated AML machine-learning prognostic models show good discrimination overall (pooled AUC ~0.77) and the strongest performance at five years (~0.83), the horizon most relevant to consolidation intensity, transplant timing, and MRD-guided maintenance. Clinical and laboratory-based models perform comparably to gene-centric signatures at validation, offering a practical, low-cost first step; genomics can be layered to refine risk where available. Given substantial between-study heterogeneity and development-to-validation optimism, implementation should start with site-level external validation, local calibration, and pre-specified decision thresholds, followed by monitoring. These findings justify pragmatic, workflow-embedded implementation trials and give immediate guidance on where and how ML risk tools can support everyday AML decisions.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal